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Abstract 


/ 

'  Ifr  this  paper.,,-A^^  extend^ the  class  of  problems  tliat  can  be  eflectivcly  com¬ 
piled  by  parallelizing  compilers.  This  is  accomplished  with  the  doconsider  con¬ 
struct  which  would  allow  those  cqmpilers  to  paralleli/'.e  many  problen)s  in  wliid) 
substantial  loop-level  parallolysm  is  available  but  rannf)t  be  detected  by  sfamlard 
compile-time  analysis.  We''tlescril)e  an<!  exi>erimenl;dlv  analyze  mechanism';  nsml 
to  parallelize  the  work  required  for  these  types  of  loops.  In  each  of  these  method^ 
a  new  loop  structure  is  produced  by  modifying  the  looj)  to  be  parallelized.  We  also 

- present  the  rules  by  which  thc,se  loop  transformations  may  be  automated  in  order 

that  they  be  included  in  language  compilers.  The  main  application  area  of  our  ro- 
search  involves  problems  in  scientific  computations  and  engineering.  The  workload 
— useiTTir OUT  experiments  inclu^  a  mixture  of  real  problems  as  well  as  synllu'li- 
callv  generated  inputs.  ITom  tmf  extensive  tests  on  I  Ik-  I'mcore  Multimax/.'t'itt.  we 
have  reached  the  conclu.sion  that  for  the  types  of  workloads  we  have’  investigated, 
self-execution  almost  always  performs  better  than  pn'-se  licduling.  I'nrlher.  I  hi'  im¬ 
provement  in  performance  that  accrues  as  a  result  of  global  topological  sort  ing  of 
indices  as  opposed  to  the  less  expen.sive  local  sorting,  is  not  very  significant  in  tin- 
case  of  self-execution.  ^ - - 

^This  research  was  supported  by  the  U.S.  Office  of  Naval  Research  under  Grant  N00014- 
86-K-0310,  the  United  States  Air  Force  Office  of  Scientific  Research  under  Contract  No. 
AFOSR  Grant  No.  AFOSR  88-0117,  and  the  National  Aeronautics  and  Space  Administra¬ 
tion  under  NASA  Contract  No.  NASl-18107  while  the  first  author  was  in  residence  at  the 
Institute  for  Computer  Applications  in  Science  and  Engineering  (ICASE),  NASA  Langley 
Research  Center,  Hampton,  VA  23665. 


1  Introduction 


There  exi^t  many  prolhems  in  which  substantial  parallelism  is  available  but  where 
the  paiallelism  cannot  be  exploited  using  the  two  principal  concurrent  loops  described 
in  the  literatvire:  doall  and  doacross  [14]  [Gj.  doall  loops  do  not  impose  any  ordering 
on  loop  iterations  while  doacross  loops  impose  a  partial  execution  order  in  the  sense 
that  some  of  the  iterations  are  forced  to  wait  for  the  partial  or  complete  execution  of 
some  preA’ious  iterations.  We  propose  a  new  type  of  loop,  the  doconsider  construct. 
The  doconsider  loop  allows  loop  iterations  to  be  ordered  in  new  ways  that  preserve 
dependency  relations  and  increase  concurrency.  Often,  these  sorts  of  index  reorderings 
can  be  done  at  very  low  cost  and  can  have  substantial  benefits. 

A  v-ariety  of  systems  for  restructuring  loops  and  reordering  indices  hav'e  been  de¬ 
veloped  in  the  functional  language  and  systolic  array  generation  communities.  These 
methods  rely  on  being  able  to  detect  the  existence  of  uniform  or  quasi-uniform  recur¬ 
rence  relations  at  compile-time.  The  dependency  vectors  characterizing  these  recurrence 
relations  are  examined  and  a  new,  hopefully  more  efficient  way  of  traversing  the  depen¬ 
dency  graph  is  found.  We  are  able  to  handle  loops  whose  inter-iteration  dependency 
may  be  complex  or  where  the  dependences  may  be  determined  by  variables  whose  val¬ 
ues  are  not  available  until  program  execution  begins.  The  methods  we  present  here 
set  up  the  framework,  at  compile-time,  for  performing  a  loop  dependency  analysis  and 
produce  a  restructured  loop  that  is  reorderd  on  the  basis  of  the  information  obtained 
from  the  dependency  analysis.  The  actual  dependency  analysis  is  performed  at  the  start 
of  program  execution.  We  will  show  that  this  kind  of  analysis  can  be  performed  very 
quickly  and  has  very  substantial  payoffs. 

Symbolic  transformations  are  used  to  produce:  (1)  scheduling  procedures  that  re¬ 
order  and  repartition  index  sets  of  loops  and  (2)  executors  or  transformed  versions  of 
source  code  loop  structures.  These  transformed  loop  structures  carry  out  the  calcula¬ 
tions  planned  in  the  scheduling  procedures.  An  executor  may  be  regarded  as  a  doacross 
loop  that  executes  loop  iterations  in  a  modified  order. 

The  sche<luling  mechanisms  we  explore  are  based  on  a  topological  sort.  The  index 
set  is  partitioned  into  di.sjoint  subsets  of  indices  or  wavefronts,  such  that  work  pertaining 
to  all  indices  in  a  wavefront  may  be  carried  out  in  parallel.  One  method  called  global 
scheduling,  [)erforms  a  topological  sort  of  index  set  and  assigns  indices  to  processors 
in  a  way  that  evenly  partitions  the  work  in  each  wavefront.  In  each  processor,  indices 
are  scheduled  in  order  of  inentasing  wavefront  number.  The  other  method  called  lo¬ 
cal  scheduling,  starts  out  with  a  fixed  assignment  of  indices  to  processors  and  simply 
rearranges  the  local  ordering  of  those  indices  to  improve  parallelism. 

We  investigate  two  types  of  executors  in  which  indices  belonging  to  each  wavefront 
are  partitioiu'd  among  the  processors.  In  tlie  first  executor,  based  upon  j>re- scheduling, 
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global  synclu'onizations  ‘;oparate  consecutive  wavefronts.  In  the  second  executor,  whcih 
we  call  self-executing,  a  shared  array  is  used  to  indicate  whether  a  solution  variable  has 
been  calculated.  Global  s\-nchronizations  are  replaced  by  busy  waits  that  ensure  that 
needed  values  have  been  produced  before  those  values  are  used. 

We  investigate  the  performance  tradeoffs  that  characterize  the  different  scheduling 
and  execution  methods  we  propose.  The  investigation  uses  a  complete,  commercial 
sparse  matrix  solver  (PCGPAK  [4])  used  to  solve  a  range  of  linear  systems,  a  synthetic 
workload  is  also  employed.  We  first  clearly  delineate  the  performance  tradeoffs  between 
pre-scfi^duled  and  self-executing  loops.  To  fully  explain  the  performance  tradeoffs  be¬ 
tween  these  types  of  loops,  we  need  to  be  able  to  quantitatively  explain  the  performance 
we  are  observing.  We  present  a  set  of  experiments  and  analysis  able  to  account  for  how 
time  is  spent  in  the  two  different  kinds  of  loops. 

The  method  used  to  rearrange  the  index  set  of  the  loop  to  be  parallelized  will  de¬ 
termine  both  the  potential  performance  benefits  that  can  be  gained  and  the  overhead 
that  must  be  paid,  ^^'e  study  the  tradeoffs  between  local  and  global  index  set  scheduling 
and  conclude  that  for  self-executing  loops,  local  scheduling  appears  to  lead  to  multipro¬ 
cessor  perfoj  iuance  that  is  comparable  to  global  scheduling  in  problems  of  interest  at  a 
significantly  lower  overhead  cost. 

From  the  results  of  experiments,  we  have  reached  the  conclusion  that  for  the  types  of 
workloads  we  have  investigated,  self-execution  almost  always  performs  better  than  pre¬ 
scheduling.  Further,  the  improvement  in  performance  that  accrues  as  a  result  of  global 
topological  sorting  of  indices  as  opposed  to  the  less  expensive  local  sorting,  is  not  very 
significant  in  the  case  of  self-execution.  Thus,  we  are  left  with  a  2-dimensional  solution 
space,  as  dt'picted  in  Figure  1.  w’hich  pictorially  siimmarizes  the  findings  reported  in 
this  paper. 

The  rest  of  this  paper  is  organized  as  follows:  In  Section  2,  we  provide  simple  rules 
that  allow  the  transformation  of  certain  types  of  loops  into  different  parallel  forms. 
These  rules  ran  be  inserted  into  parallelizing  compilers,  extending  the  class  of  prob¬ 
lems  that  can  be  effectiwdy  compiled  for  parallel  machines.  We  describe  some  of  the 
related  research  in  Section  3.  A  simple  mathematical  model  which  captures  the  tradeoff 
between  load  balance  and  synchronization  costs  is  described  in  Section  4.  The  results 
of  multiprocessor  experiments  are  pre.sented  in  Section  5.  These  experiments  provide 
a  quantitati\e  performance  study  of  the  schedulers  and  executors  under  consideration. 
Finally,  we  summarize  our  findings  in  Section  6. 
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Performance  of  Scheduling  and  Sorting  Strategies 
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Figure  1;  Summary  of  Results 
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1:  do  i=l,n 

2:  x(i)  =  x(i)  +  b(i)*x(ia(i)) 

3 :  end  do 


Figun^  2:  A  siinplr  looj) 

2  The  Automated  Execution  System 

2.1  Motivation 

In  a  broad  sense,  modules  of  code  in  parallel  programs  are  either  compile-time  or  run¬ 
time  schcdulablo.  In  order  that  a  code  be  compile-time  schedulablc-.  it  needs  to  possess 
sufficient  iutbrination  so  that  the  compiler  is  able  to  extract  the  parallelism  and  map 
and  schedule  the  code,  e.g.,  doall  type  loops  in  Fortran[l5].  In  certain  other  types 
of  codes,  examination  of  run-time  data  is  absolutely  critical  in  order  to  detect  hidden 
parallelism.  We  have  been  interested  in  the  study  of  such  problems.  Within  this  class  of 
run-time  schedulablc  codes,  there  are  two  main  categories,  i.e.,  those  that  are  start-time 
schedulablc  and  those  that  are  not. 

Codes  are  start- time  schedulablc  if  all  data  dependences  are  resolved  before  the  pro¬ 
gram  begins  e.xecution  and  if  these  dependences  do  not  change  during  the  course  of 
the  comptitaf ion.  For  (odes  that  are  not  start-time  schedulable,  tlu'  data  dependences 
may  be  determined  by  functions  who.se  parameters  are  other  functions,  the  values  of 
which  are  only  computed  at  some  unknown  point  during  the  computation.  In  [11],  we 
present  self-execution  primitives  that  aid  greatly  in  the  on-the-fiy  detection  of  paral¬ 
lelism  in  such  prol)lems.  In  this  present  paper,  we  will  only  be  concerned  with  start-time 
sclK'dulable  problems. 

Standard  rechnicpies  develoi)ed  by  researchers  in  the  field  of  parallel  imperative  com¬ 
pilers  can  d<  rermine  when  the  data  structure  that  describes  the  dei)endency  relations 
is  not  changed  during  the  course  f)f  the  comitutation  [ij. 


2.2  Transformation  rules  for  automated  system 

In  this  section,  we  describe  the  ink's  by  which  an  automated  sj-mltolic  manipulator 
performs  souree  to  source  transformation  of  a  sequential  ii.ser  code  into  a  suitable  parallel 
version.  Th<  se  rules  can  lie  included  in  a  conventional  i);ir;dlelizing  compiler  so  that  the 
class  of  prolileins  that  can  i>e  handled  by  these  compilers  is  extended  to  include  those 
that  are  start -time  schedulable. 

A  loop  of  the  form  shown  in  Figure  2,  may  be  exectited  many  times  during  the  run- 
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1:  doconsider  i=l,n 
2:  x(i)  =  x(i)  +  b(i) *x(ia(i) ) 

> 


Figure  3:  An  annotated  loop 


1:  do  i*l,nlocal 

la;  isched  =  schedule(i) 

lb:  needed_index  =  ia(isched) 

2a:  if  (needed_index  >=  isched)  then 

2b:  x(isched)  =  xold(isched)  + 

b(isched)*xold(needed_index) ; 

else 

while  (ready (needed_index)  .ne.  COMPLETED))  end  while 
x(isched)  =  xold(isched)  + 

b(isched)*x(needed_index) ; 
ready (isched)  =  ready(isched) +1 ; 
endif 
enddo 


Figure  4;  A  Self-Executing  loop 


3a: 

3b: 

3c : 
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la:  do  i=l,nlocal 

lb:  isched  =  scheduled) 

Ic:  if  (isched  .eq.  NEWPHASE)  then 

Id:  call  global  synchronization 

else 

2a:  needed.index  =  ia(isched) 

2b:  if  (needed.index  .ge.  isched)  then 

2c:  x(isched)  =  xold(isched)  + 

b (isched) ♦xold (needed. index) 

else 

2d:  x(isched)  =  xold(isched)  + 

b( isched) ♦x (needed. index) 

endif 
endif 
enddo 


Figure  5;  A  Pre-scheduled  Loop 
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ning  of  a  giwii  program.  We  refer  to  tliis  program  as  simple.  Ihe  data  dependences 
lietween  the  element. s  of  x  are  determined  by  the  values  assigiu'd  during  program  exe¬ 
cution  to  th<'  data  structure  in.  A  vahu'  of  the  out('r  loop  iiuU'X  i,  /|  has  a  depf'iidencf^ 
on  (Uiother  \alue  of  tlu'  outer  loop  indc'x  1-2  if  the  conii)ut;itio!i  of  .r(/i  )  retpiii'es  x(i2). 

In  the  fii'^t  version  of  our  system,  user  programs  will  need  simple  annotations  which 
will  direct  the  compiler  to  invoke  its  run-time  parallelization  modules.  This  will  ap¬ 
ply  to  complete  parallel  hmguages  as  well  as  language  extensions  which  have  explicit 
parallel  sections.  We  propose  to  provide  language  extensions  using  constructs  such  as 
doconsider  and  f  orconsider,  depending  upon  the  language  being  extended.  An  an¬ 
notated  user  code  corresponding  to  Figure  2  is  shown  in  Figure  3.  These  constructs 
will  be  used  in  addition  to  the  doall  and  doacross  type  loops  already  provided  in  such 
systems.  Details  of  these  language  extensions  are  currently  being  finalized.  The  me¬ 
chanical  process  by  which  the  run-time  modules  are  invoked  is  described  in  [12].  Briefly 
however,  an  annotation  of  the  type  forconsider  will  generate  code  that  is  able  to  sort 
the  indices  on  a  processor  in  order  of  increasing  wavefront  number  ( dc'tails  of  this  sort¬ 
ing  procedure  are  provided  in  Section  2.3).  Next,  an  appropriate  transformation  will  be 
invoked  to  produce  an  executor  to  actually  run  the  code  using  the  newly  created  index 
ordering. 

The  example  code  shown  in  Figure  2  has  been  chosen  for  ea.se  of  explanation  of  the 
transformations  we  will  present  shortly.  In  the  system  that  we  an*  designing,  realistic 
codes  that  tend  to  be  much  more  complex  in  structure  can  and  will  be  handled. 

To  paralh'lize  such  loops,  the  method  we  use  is  as  follows;  We  first  partition  the 
indices  of  the  outer  loop  of  Figure  2  into  disjoint  sets  5,,  such  that  zow  substitutions  in  a 
set  S,  may  be  carried  out  independently.  To  obtain  the  sets  5,.  we  perform  a  topological 
sort  of  the  directed  acyclic  dependence  graph  G  that  describes  the  dependences  between 
the  outer  loop  indices.  Stage  k  of  this  sort  is  performed  by  placing  into  set  Sfc  all  indices 
of  G  not  pointed  to  by  graph  edges.  Following  this  all  edges  that  ('inanated  from  the 
indices  in  S\  are  removed.  The  elements  of  are  said  to  belong  to  wavefront  k.  A 
single  program  multiple  data  method  of  izroblem  decomposition  is  used;  the  wavefront 
information  is  used  to  prepare  a  schedule  of  outer  loop  indices  to  be  executed  by  each 
iprocessor. 

The  main  loop  in  Figure  4  corresjzonds  to  th<’  iiulices  assigned  to  this  processor  (line 
1).  The  key  [zoint  in  Figure  4  has  to  do  with  line  3a  and  the  while  loop  which  ensures 
that  an  index  is  never  used  until  it  has  bei'ii  computed.  Finally,  the  array  ready  is  used 
to  maintain  the  status  of  all  the  indices.  In  Figure  o.  we  depict  the  code  tran.sformed 
into  one  that  u.ses  barrier  .synchronization  at  the  eiul  o'f  each  pha.'-e.  Before  this  code 
is  executed,  it  is  assiimed  that  a  topological  sort  of  the  data  depemh'uces  is  performed 
and  the  end  of  a  phase  is  marked  by  a  special  flag  with  the  appropriate  index  on  every 
processor.  check  is  made  to  see  if  the  end  of  phase  is  reached  and  if  so,  a  call  is 
made  to  global  synchronization.  The  rest  of  the  code  is  self-explanatory.  It  should 
l)e  noted  here  that  we  first  partition  the  index  set.  Given  this  static  partition  each 
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doconsider  i=l,n 
t  emp  *  f  ( i ) 
do  j=l,m 

y(i)  =  y(i)+  temp+y (g(i , j ) ) 
enddo 
enddo 


Figure  6;  A  nested  loop 

processor  is  informetl  when  it  should  perform  work  associated  with  each  of  its  assigned 
indices. 


2.3  Efficient  Calculation  of  the  Topological  Sort 

The  schedule  of  outer  loop  indices  for  each  processor  can  be  obtained  by  global  schedul¬ 
ing,  assigning  indices  to  processors  in  a  way  that  evenly  partitions  the  work  in  each 
wavefront.  In  each  processor,  indices  are  scheduled  in  order  of  increasing  wavefront 
number.  Alternately  using  local  scheduling,  one  begins  with  a  fixed  assignment  of  in¬ 
dices  to  processors  and  uses  the  wavefront  information  to  simply  rearrange  the  local 
ordering  of  those  indices  to  improve  parallelism. 

The  loops  in  the  source  code  can  be  transformed  to  assign  a  wavefront  number  to 
each  loop  index.  For  instance,  a  looj)  of  the  form  depicted  in  Figure  6  is  converted  to 
the  transformed  loop  in  Figure  7.  Since  the  wavefront  number  for  each  index  is  one 
plus  the  maximum  of  the  wavefront  numbers  of  the  indices  on  which  it  depends,  one 
can  simidy  sweep  sequentially  through  the  indices  and  calculate  the  wavefront  for  each 
index.  Figure  7  depicts  a  version  of  the  topological  sorting  procedure.  This  process 
I)roduces  an  array  maxwfy.  as  shown  in  Figure  7.  Array  maxwfy  must  then  be  sorted  to 
produce  an  execution  schedule  for  the  processors. 

On  the  Mnltimax/320.  the  sequential  execution  time  required  for  i^oth  these  opera¬ 
tions  tends  to  be  slightly  less  than  the  cost  of  a  single  triangular  solve  using  the  same 
matrix.  The  topological  sort  can  l>e  parallelized  to  a  degree  Ity  strii'ung  consecutive  in¬ 
dices  across  the  processors  and  by  using  busy  waits  to  assure  that  variable  \7ilues  have 
been  prodticed  before  being  used. 

Wdiile  local  scheduling  is  almost  completely  parallel! zable,  it  is  not  clear  how  one 
would  efficiently  parallelize  global  scheduling.  The  interproces.sor  coordination  required 
for  this  rather  fine  grained  computation  appears  to  be  prohibitive  in  the  absence  of  a 
fetch  and  add  jirimitive. 
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do  i=l,n 

temp  =  f(i) 
mywf  =  0 
for  do  j=l,m 

mywf  =  max(maxwfy(g(i, j)) ,mywf ) 
enddo 

maxufy(i)  =  mywf; 
enddo 


Figure  7:  Computation  of  Wavefronts 

We  now  provide  a  short  stepwise  description  of  the  automated  i^rocedure  which  takes 
as  input  a  code  of  the  type  shown  in  Figure  6  and  restructures  it  into  a  suitable  parallel 
version.  St('ps  1  through  3  are  performed  at  compile-time,  while  steps  4  and  5  are 
performed  at  run-time. 


1.  The  indices  of  the  computation  are  logically  distributed  among  the  processors  in 
some  sfjecified  manner. 

2.  A  topological  sort  code  is  then  generated  by  the  compiler,  during  program  execu¬ 
tion  this  code  which  determines  the  wavefront  number  of  each  index  (Figure  7). 

3.  The  loop  in  Figvire  G  is  transformed  into  a  self-executing  or  pre-scheduled  ver¬ 
sion.  with  the  optional  insertion  of  the  code  that  repartitions  indices  among  the 

proce.S'ors. 

4.  At  stait  of  execution.  th<‘  wavefront  numbers  are  computed  and  the  indices  are 
sorted  oil  the  basis  of  tliese  wavefronts.  The  indices  may  or  may  not  be  reparti¬ 
tioned. 

o.  The  actual  computation  is  now  performed  by  each  processor  on  its  assigned  subset 
of  indices,  using  one  of  the  executors  that  have  been  generated,  as  in  step  3. 


3  Related  Work 


The  execution  of  parallel  tasks  using  self-scheduling  has  received  considerable  attention. 
Lusk  and  Overbeek  [10]  implement  a  seif- scheduled  mechanism  to  dynamically  allocate 
work  to  proees.sors.  While  this  method  has  the  advantage  of  simplicity,  many  of  the 
more  corni)!' x  dynamic  jjrobleins  that  wc-  are  interested  in  solving  do  not  seem  to  be 
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t’asih  '  ii  uml.irad  in  this  framework.  Polychronopoulos  and  Kuck  [IG]  are  concerned 
with  till  efficii-ut  execution  of  doall  type  loops  using  run-time  sidf  srhi'duling.  While  the 
I'Hicaey  of  self-scheduling  for  certain  clas.ses  of  jiroblems  on  shared  memory  machines 
is  demonstrati  <1  in  that  paper,  more  comiilex  problems  which  cannot  be  formulated  in 
a  doall  .setting  are  not  sttidied.  Tang  and  Yew[l9]  describe  a  mechauisin  to  exi'cute 
multiple  nested  doall  loojis.  u.siug  self-scheduling.  It  is  shown  that  for  certain  types  of 
problems,  self-scheduling  is  more  efficient  than  pre-scheduling  using  static  assignment  of 
loe  iterations  to  processors.  Krothapalli  and  Sadayappan[9]  describe  a  method  which  is 
able  to  remove  anti-  and  output-dependences,  by  performing  an  analysis  of  the  reference 
pattern  generated  and  using  multiple  copies  of  variables  in  order  to  simulate  a  single 
assignment  language.  Cytron[Gj  discusses  the  problem  of  how  to  schedule  doacross  loops 
with  lexically  backward  dependences  by  introducing  delays  in  appropriate  places  in  the 
code  to  ensure  correctness.  A  linear  programming  problem  is  formulated  and  solved  in 
order  to  calculate  the  minimum  delays. 

Loop  restructuring  has  been  used  successfully  to  allow  parallelizing  compilers  to 
improve  parallelism  and  enhance  performance  in  memory  hierarchies  [14],  [15], [2], [7], 

To  our  knowleilge,  there  has  been  no  work  in  the  automatic  detection  of  run-time 
parallelism  along  with  the  restructuring  of  such  loops  for  efficient  scheduling. 

Numerical  methods  for  solving  sparse  triangular  systems  have  however  employed 
closely  related  schemes  to  reorder  operations  to  increase  available  parallelism,  [3], [IS], [5], [S], [17]. 

.\s  far  as  performance  improvement  is  concerned,  we  show  the  efficacy  of  our  tech- 
nirpies.  From  a  programming  language  standpoint,  we  believe  that  user  codes  for  parallel 
machines  ought  not  to  include  the  details  of  scheduling  and  mapping.  This  has  several 
advantages;  program  portability  will  certainly  become  more  feasible  and  program  dcvel- 
ojiment  time  will  decrea.se.  We  believe  that  robust  transformations  which  automatically 
restructure  programs  to  exploit  parallelism  will  aid  in  reducing  the  effort  required  to 
])rogram  j)aiallel  machines. 


4  Description  and  Analysis  of  Model  Problems 

4.1  Model  Problems 

There  are  se\<  lal  wa\>  to  generate  the  w(»rkload  needed  to  test  the  various  aspects  of 
till'  system.  In  our  expi’riments.  these  model  i>roblems  come  from  two  main  sources,  i.e., 
the  solutions  of  sparse  linear  systems  arising  from  a  variety  of  partial  differential  eipia- 
fions  using  preconditioned  Krylov  methods  and  from  parameterized  sjmthetic  workload 
generators.  \\e  examine  in  particular  detail  the  solution  of  sparse  triangular  systems 
obtained  through  incomplete  factorizations  of  matrices  arising  from  discretizations  of 
the  partial  d.ifferential  equations  in  question  on  a  variety  of  two  and  three  dimensional 
meshes.  A  docription  of  the  problems  solved  are  found  in  Appendix  1.  The  solution  of 
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SI:  do  i*l,n 

y(i)  =  rhs(i) 

S2:  do  j=ija(i) ,ija(i+l)-l 

y(i)  *  y(i)  -  a(j)*y(ija(j)) 
end  do 
end  do 


Figure  8:  Triangular  Solve 

these  sparse  triangular  systems  accounts  for  a  large  fraction  of  the  sequential  execution 
time  of  these  linear  solvers.  The  dependences  encountered  in  solving  these  systems  in¬ 
hibit  the  parallelization  of  the  outer  loop  of  row  substitutions  (SI  in  Figure  8).  Typically 
the  number  of  non-zero  elements  in  a  row  is  too  small  to  allow  efficient  parallelization 
of  the  inner  loop  (S2  in  Figure  8). 

We  also  present  overall  performance  results  for  a  commercial  preconditioned  Krylov 
solver  PCGPAK  which  was  completely  parallelized.  Parallelization  was  carried  out 
using  either  the  pre-scheduled  or  self-executing  constructs  presented  here.  Details  of 
how  the  parallelization  was  carried  out  are  presented  in  Appendix  2,  a  much  more 
detailed  account  of  the  PCGPAK  results  is  presented  in  [4]. 

For  a  moK'  general  source  of  matrices,  we  utilize  a  a  simple  workload  generator 
which  is  able  to  incorporate  the  important  parameters  such  as  locality  of  communica¬ 
tions.  volunu’  of  communication  betweeiz  nodes  etc,  in  the  generation  of  matrices.  The 
synthetic  workload  generator  should  have  the  following  properties: 

•  The  outjmt  of  the  generator  should  approximately  be  able  to  describe  approxi¬ 
mately  some  of  the  real  problems  we  encounter,  implying  that  the  workload  is  not 
completely  random. 

•  It  should  be  easy  to  vary  the  input  parameters  of  the  workload  generator  to  test 
certain  canonical  features  of  the  sofware  sj’steni. 

Clearly,  having  such  a  generator  will  i)rovide  faster  turnaround  time  for  performance 
testing  and  la-cause  it  will  be  ea.sy  to  vary  the  ])arameters,  the  testing  of  the  software 
modules  will  Ije  more  robust. 

Most  of  the  problems  that  we  havi'  been  interested  in  solving  have  the  following 
charactertistics: 

•  The  computation  is  defined  over  a  reasonably  large  index  set  of  values. 

•  There  ('xists  a  phas<'  structure  implicit  in  the  computation  such  that  not  all  indices 
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can  !)('  at  the  same  instant  beraust'of  certain  data  deix  ndences  that  must 

be  satisfied. 

•  Usuall}',  iudie^'s  inteiaet  with  otlier  indices  that  are  close  l)y,  whei  '  closeness  is  a 
h'atiire  of  the  ])hysical  j)robl('m  Ix  ing  solv<'d. 

In  the  first  implementation  of  the  workload  generator,  we  hav^e  made  the  following 
approximations:  The  input  domain  consists  of  a  2-dimensional  mesh  of  points  whose 
connections  have  yet  to  be  established.  Each  point  in  the  mesh  is  a  unique  index  of 
the  computation  to  be  performed  using  that  mesh.  The  points  are  numbered  using 
their  natural  ordering.  We  use  two  probability  distributions  to  model  the  workload;  one 
determines  the  total  number  of  dependencj-  links  between  an  index  and  other  indices  in 
the  domain,  the  other  is  used  to  determine  the  locality  of  the  links  to  be  forged. 

The  number  of  indices  that  any  given  index  needs  to  communicate  its  output  with 
is  given  by  a  Poisson  density  function,  with  parameter  A.  The  Poisson  approximation 
is  reasonable  because  several  physical  phenomena  can  be  modeled  using  this  random 
variable.  The  density  function  for  this  random  variable  is  defined  as  follows: 

Pi  =  Pr[.Y  =  i]  =  li\,  i  >  0 

Depending  upon  the  value  of  A,  the  probability  density  can  be  varied  to  suit  the 
problem  at  hand.  Furtlu'r.  the  Poisson  density  function  is  often  used  as  an  approxi¬ 
mation  to  the  Binomial  density  function,  which  is  normally  much  more  expensive  to 
compute. 

It  is  known  that  many  problems  in  nature  have  the  property  that  spatial  regions  of 
the  problem  domain  tend  to  interact  more  intensely  with  adjacent  or  close-by  regions. 
.A.  probability  density  function  that  possesses  such  a  property  is  the  geometric  density 
function.  A  random  variable  X  has  a  range  1,  2,...  and  density  function 

Pr[A'  =  ;'=(!  —  p)p\  V/  >  0 

The  intc’rpretation  that  we  ascribe  to  this  density  function  is  as  follows:  If  an  index 
value  k  i.^  to  rcunmunicate  with  an  index  that  is  i  distant  from  it.self.  the  set  of  indices 
that  are  i  units  away  (vising  the  Manhattan  metric)  from  index  k  is  determined.  One  of 
these  indices  (if  any]  is  selected  in  order  to  make  a  connection  with  k.  This  process  of 
making  connections  is  continued  until  all  the  links  are  exhausted  for  each  index.  Thus, 
we  can  generate  a  fhita-dependency  matrix  using  the  mesh  generated  by  the  above 
procedurv'. 

In  the  following  subsection,  we  present  the  results  of  experiments  by  which  we  de¬ 
termine  the  pv'rformance  of  the  schedulers  and  executors  under  consideration. 
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Sorted  List  =  (1, 2, 8, 5, 9, 15, d, 10, 16, 22, 5, 1  1, 17, 23, 29, 

6,12,18,24,30,7,13,19,25,31,14,20,26,32 

21,27,33,28,34,35) 


Figure  9;  Assignment  of  Indices  to  Wavefronts 

4.2  Analysis  of  a  Model  Problem 

We  will  use  a  model  problem  to  illustrate  the  performance  difference  l>etweeii  usiiig  jiif'- 
scheduing  and  self-execution.  We  will  examine  this  l)y  estimating  the  time  that  would 
be  required  to  solve  a  lower  triangular  system  generated  by  the  zero  fill  factorization  of 
the  matrix  arising  from  a  rectangular  mesh  with  a  five  j)oint  template.  We  will  use  a  vi 
by  ;i  domain  and  p  <  min(m,7j)  processors.  We  will  explicitly  take  into  arronnt  only- 
floating  point  and  synchronization  related  computations.  In  Secticui  5  we  d('monstrate 
experimentally  that  these  as.suinj>tions  can  be  used  to  ])redict  innlti])rocessor  timings 
rather  accurately. 

We  assume  that  all  computations  required  to  solve  the  problem  would  r('((uiv'  time 
5  on  a  single  processor,  and  that  computation  of  each  i)oint  takes  time  Tp  --  .9/(?7j??); 
This  ignores  the  relatively  minor  disparities  caused  by  the  matrix  rows  represented  l)y 
points  on  the  lower  and  the  left  boundary  of  the  domain. 

To  understand  the  relative  performance  of  the  two  synchronization  mechanisms  on 
this  problem,  we  need  to  make  clear  how  the  indices  are  mapped  onto  tlu-  maehinc’s 
processors.  The  global  topological  sort  produces  a  list  of  indices  sorted  by  wavi'front. 
The  points  in  a  wavefront  arise  from  an  anti-diagonal  strip  of  the  domain.  For  instance, 
in  Figure  9,  we  depict  a  five  by  seven  domain  with  th('  points  in  each  wavefront  linked 
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Figure  10:  Assignment  of  Indices  to  Processors 

by  an  antidiagonal  stripe.  When  the  points  in  the  domain  are  naturally  ordered  the 
topological  sort  produces  a  list  L  that  picks  points  on  the  anti-diagonal  strip  going  from 
the  upper  riglit  point  in  the  strip  to  the  lower  left  point.  This  corresponds  to  arranging 
the  points  in  each  wavefront  in  order  of  increasing  iinU'X  number. 

The  indices  in  L  arc  assigned  in  a  wrapped  manner,  as  dei)ic(ed  for  the  example 
problem  in  Figure  10.  When  pro- scheduling  is  used,  the  computation  is  divided  into 
phases  separated  by  global  synchronizations. 

A  brief  inspection  of  Figure  9  makes  it  clear  that  n  +  n?  —  1  phases  are  required  to 
complete  the  computation.  Define  MC{j)  as  the  maximum  number  of  points  computed 
by  any  processor  during  phase  j.  The  computation  time  retjuired  to  comi)Iete  pliase 
j  is  equal  to  TpMC{j).  The  computation  time  required  to  compk-te  the  pn'Mem  is 
consequently 

n+m-l 

E  T,MC{j). 

j=i 

We  now  proceed  to  calculate  MC{j).  During  i)hnse  j,  a  total  of  ;  stri])s  must  be 
computed  when  1  <  J  <  min(7n,7i).  Since  the  strips  are  assigned  in  !v  \vrap]>ed  manner, 

Mco)=  r-i. 

P 
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When  niia(?n,n)  ^  <  n  +  m  —  niin(rrj,n),  a  total  of  inin(/;?,;))  strips  must  be 
completed  during  phase  j.  Due  to  the  \vrai)ped  assignment  of  Ktrii)s  to  processors, 

P 

Finally  when  ?i  +  m  -  min(m,  n)  <  j  <  n  +  m  -  1,  a  total  of  v  +  m  -  j  strips  must  be 
computed  during  phase  j  so 

A/co)  =  r^^i^i. 


The  computation  time  required  to  complete  the  problem  is 

n+m-l 

rc  =  T,  Y:  -wcu) 

j=i 

^  nun(m,n)  — 1  •  .  \ 

=  - (  r~1  +  (n  +  m  -  2min(?Ti,n)  +  IjC— + 

mn  p  p 

j=m+n-min(m,n)  +  l  ^ 


By  assumption,  the  sequential  time  to  solve  the  problem  is  S  =  vnnTp.  The  estimated 
efficiency  Eop>  we  could  achieve  in  the  absence  of  any  source  of  inefficiency  unrelated  to 
load  imbalance  would  be  or 

‘C 


min(m,n)  — 1  •  /  \ 

z2i  [“1  +  ~  2min(m,  u)  +  1)[ - - ^]  + 


.=1  P 

n+m  — 1 


p 


.?=rn+n— nun(  m.n)-f  1 


(2) 

We  can  derive  a  simph^r  expression  that  approximates  Eopi  by  estimating  the  total 
amount  of  time  all  processors  spend  idle  due  to  load  imbalance.  Let  m  and  h  be  equal 
to  the  largest  multiples  of  p  that  are  smaller  than  ui  and  rt  respectively.  During  any 
pha.se  j  <  min(m,n)  —  1  when  j  is  not  a  multiple  of  p,  there  are  p  —  j  mod  p  processors 
idle.  When  /  is  a  multiple  of  p,  no  processors  are  idle.  Thus  the  cumulative  proces.sor 
idle  time  for  j  <  min(TU,  u)  ~  f  i®' 

rpmin(m,n)^f^,(/  -  1)  Tp  minfm,  r0(/>  -  D 
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Through  similar  reasoning,  the  sum  of  the  processor  idle  time  for  the  last  min(m,  n)  -  1 
phases  is  the  same. 

Between  the  first  and  last  min(m,n)  —  1  phases  if  inin(??/,  ii)  is  etjual  to  p,  no  time 
is  weisted,  otherwise  the  time  lost  per  phase  is 

Las  =  Tp{p  —  min(m,  n j.  od p) 

We  esm  use  the  above  considerations  to  estimate  the  cumulative  time  wasted  by  all 
processors,  and  use  this  estimate  to  calculate  the  following  approximate  expression 
which  gives  E„pt  = 

mn 

tun  +  mini  m,fi){p  —  1)  +  (m  +  n  +  1  —  2 min(m,  n))((p  —  min(7n,  n))  mod  p 

Much  of  the  load  imbalance  we  observe  above  can  be  corrected.  The  failure  to 
balance  is  essentially  an  end-effect;  e.g.,  the  phase  has  p  -f  1  work  units  with  equal 
computational  demands,  but  only  p  processors  are  available.  In  [13]  we  rearrange  the 
global  synchronizations  in  a  way  that  obtains  a  tradeoff  between  improved  load  balance 
and  the  costs  of  the  global  synchronizations.  While  that  mechanism  is  shown  to  be 
advantageous  for  some  problems,  rearrangement  of  the  global  synchronizations  does 
require  an  extra  stage  of  preprocessing. 

Self-execution  also  eliminates  these  end  effects.  In  the  model  i)roblem  we  are  pre¬ 
senting  here,  we  can  see  that  any  given  row  substitution  in  a  wavefront  requires  only  two 
solution  values  from  the  previous  wavefront.  It  is  possible  to  to  concurrently  compute 
row  substitutions  in  consecutive  wavefronts  provided  that  we  observe  dependences.  This 
is  taken  care  of  naturally  since  the  self-execution  b^isy  wait  synchronization  mechanism 
ensures  that  dependences  are  in  fact  observed. 

Figure  11  depicts  the  data  dependences  between  row  substitutions  in  the  model 
problem.  Assume  that  solution  values  are  available  for  indices  in  list  L  through  the 
index  corresi>onding  to  wavefront  u>,  domain  strip  s.  All  indices  in  L  up  to  the  index 
correspondin'!;  to  wavefront  -f  1,  domain  strip  s  will  have  their  dependences  satisfied 
and  can  be  concurrently  calculated. 

We  can  df-rive  an  e.xpression  for  Eopt  for  the  self-executing  case.  Assuming  again 
that  the  time  required  to  compute  the  sohitions  is  identical  for  all  indices,  only  the  first 
and  last  p  —  1  wavefronts  contribute  to  load  imbalance.  By  arguments  similar  to  those 
made  for  the  ])re-scheduling  case,  the  cumulative  processor  idle  time  is  p(p  —  1).  Eopt  is 
thus  given  by 


mn 

m  n  -f  p(p  -  1 ) 


(5) 


If  Tapnch  is  the  cost  of  a  single  global  synchronization,  the  time  required  to  synchronize 
the  pre-scheduled  computation  is  T,y„ch  times  the  numlxT  of  .synchronization  needed, 
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Figure  11:  Data  Dependences  between  Indices 

j.e.  T)ynch{n  +  m  —  1).  The  self-executing  program  ensures  global  synchronization  by 
incrementing  elements  of  a  shared  array  when  variables  are  calculated.  As  fU-sciilied 
earlier,  the  shared  array  is  checked  to  find  out  which  variables  have  been  sobc-d  for 
at  any  given  time.  The  cost  of  incrementing  the  arrny  elements  is  given  by  7’„c7nT?., 
where  Tine  is  the  cost  of  incrementing  a  single  array  element.  Since  coinpu(iiig  each 
solution  value  is  a.ssumed  to  need  two  other  solution  values,  the  cost  of  t  becking  the 
array  elements  is  estimated  by  2Tc7,ecfcmTi,  where  Tchrek  is  the  cost  of  checking  a  shared 
memory  location.  Note  that  we  have  accounted  separately  for  idle  time  flue  to  load 
imbalance;  we  assume  here  that  we  only  have  to  verify  that  a  required  sfilution  value  is 
available. 

By  modifying  the  above  expressions  for  Eopt  to  include  the  synchronization  over¬ 
heads,  we  derive  an  expression  for  the  ratio  between  the  time  required  to  soh-e  the 
model  problem  using  pre-scheduling  to  that  required  for  solving  the  problem  using  '^elf- 
execution,  In  the  expression  below,  Rsynch  =  ,  Rinc=  and  Rrheck-  - . 

^  ip  ip  /p 


=  [mn  +  {L„  2Ltr)/Tp  R,y„ch{n  +  m  -  l)][inu(l  +  Rinc  +  2Rcheck)  +  pip  -  1)]  ’ 

For  large  n  and  m  =  p-t- 1,  we  expect  to  find  that  slightly  under  half  of  the  processors 
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(6) 


arc  idle  (lu<'  to  load  imbalance.  The  above  ratio  in  the  limit  of  large  rii  becomes 

2p  T  Rsynch 

(P  +  1)(1  +  Riiic  +  ^Rcheck) 

The  abov('  expression  suggests  that  the  self-executing  program  might  be  expected  to 
perform  substantially  better  than  the  pre-scheduled  program  as  long  as  it  is  relatively 
inexpensive  to  check  and  to  increment  shared  memory.  In  practice,  one  often  obtains 
triangular  systems  that  have  a  relatively  large  number  of  phases  with  modest  amounts 
of  work  to  be  performed  in  each  phase,  as  we  will  see  in  Section  5.  The  limit  derived 
above  sheds  some  insight  into  these  cases. 

For  m  =  n  the  situation  is  quite  different;  as  n  increases  we  obtain  the  ratio 


1  +  Rinc  +  -Rcheck 

If  the  problem  size  increases  in  both  dimensions,  the  relative  contribution  of  the  end 
effect  load  imbalances  diminish.  The  amount  of  computation  to  be  performed  grows 
as  mn  while  the  number  of  global  synchronizations  needed  grow  as  n  +  m  —  1.  In 
this  case,  pre-scheduling  is  preferable  to  self-execution.  In  shared  memory  machines 
with  fast  access  to  shared  memory,  there  will  be  only  a  small  difference  between  the 
pre-scheduled  and  self-executing  times. 

Many  problems  of  practical  interest  are  somewhat  less  sparse  than  the  model  prob¬ 
lem  analyzed  here.  When  such  a  problem  is  to  be  solved  using  many  processors,  we 
may  exjject  dramatic  performance  differences  between  pre-scheduled  and  self-executing 
programs.  To  illustrate  this,  we  present  the  rather  extreme  (from  our  point  of  view) 
example  of  solving  a  n  by  n  dense  triangular  matrix  having  unit  diagonals  using  n  —  1 
processors.  A.ssume  Tsarpy  is  the  time  required  for  a  floating  point  multiply  and  add.  The 
computation  time  required  to  solve  this  system  using  self-execution  is  Tjaxpv(rr  —  1).  No 
parallelism  at  all  is  obtained  when  one  attempts  to  solve  such  a  system  when  row  sub¬ 
stitutions  ar<'  separated  by  global  synchronizations;  each  row  substitution  forms  its  own 
wavefront.  The  sequential  computation  time  and  the  pre-scheduled  computation  time 
are  both  .  Calculated  only  on  the  basis  of  load  balance,  the  self-executing 

efficiency  E, .  t  is  while  the  pre-scheduled  Eopt  is 


5  Experimental  Results 

5.1  Multiprocessor  Timings 

The  experina  ntal  results  in  this  section  are  organized  in  the  following  manner:  We 
describe  the  performance  of  PCGPAK  using  the  self-executing  and  pre-scheduled  ex¬ 
ecutors.  Next,  we  perform  a  detailed  analysis  of  the  various  timing  losses  that  occur  in 
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Table  1:  Self- Execution  vs  Pre- Scheduling  for  PCGPAK  16  Processors  of  the  Encore 
Multimax 


Test  Problem 

Pre-Scheduled 

Self-executing 

Sort  Time 

Time 

Efficiency 

Time 

Efficiency 

Time 

SPEl 

1.48 

14 

0.83 

25 

0.03 

SPE2 

2.49 

24 

1.63 

37 

0.26 

SPE3 

3.84 

35 

3.11 

44 

0.11 

SPE4 

1.04 

17 

0.66 

26 

0.03 

SPE5 

6.18 

62 

5.89 

65 

0.10 

5-PT 

3.11 

33 

2.50 

41 

0.14 

9-PT 

6.31 

42 

4.76 

56 

0.25 

7-PT 

4.90 

57 

5.41 

52 

0.19 

L5-PT 

41.76 

50 

37.93 

56 

1.40 

L9-PT 

64.01 

54 

54.74 

63 

0.80 

L7-PT 

23.20 

62 

23.51 

61 

0.79 

the  code.  This  detailed  analysis  does  not  use  PCGPAK,  instead  we  use  a  separate  set 
of  programs  written  to  study  the  issues  we  are  investigating.  The  pre-scheduled  execu¬ 
tor’s  performance  is  compared  using  local  and  global  sorting  of  the  indices  based  upon 
their  wavefronts.  Because  we  see  that  the  performance  of  the  pre-scheduled  executor 
is  almost  always  worse  than  that  of  the  self-executing  version,  we  restrict  some  of  our 
later  studies  to  the  self-execution  system. 

In  the  case  of  the  synthetic  workload,  a  matrix  represented  as  65-4-3  implies  the 
discretization  of  a  65*65  mesh  where  the  average  number  of  edges  leaving  a  mesh  point 
equals  4,  with  a  Poisson  distribution,  and  the  average  distance  between  connections 
being  3,  with  a  geometric  distribution. 


5.1.1  Pre-scheduled  vs.  self-execution 

Two  versions  of  parallel  PCGPAK,  a  Krylov  space  solver  [4],  were  produced.  In  the 
first  version,  the  triangular  solves  and  the  numeric  factorization  were  implemented  using 
self-scheduling;  in  the  second  the  triangular  solves  and  numeric  factorization  were  pre¬ 
scheduled.  In  both  cases,  the  index  set  of  the  outer  loop  of  the  appropriate  procedure  was 
partitioned  in  a  wrapped  manner.  The  timings  were  done  on  an  Encore  Multimax/320 
with  13  megahertz  APC/02  boards  and  version  2.1  of  the  FORTRAN  compiler. 

In  Table  1  we  present  time  required  to  solve  the  test  problems  for  the  pre-scheduled 
and  self-executing  versions  of  PCGPAK,  along  with  the  parallel  efficiencies  achieved. 
Parallel  efficiency  is  defined  as  the  ratio  between  the  time  required  to  solve  a  problem 
by  an  optimized  sequential  version  of  PCGPAK  and  the  product  of  the  time  required  on 
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the  same  prol>lem  by  the  multiprocessor  code  multiplied  by  the  number  of  processors. 
The  self-executing  version  of  the  program  yields  the  highest  efficiencies  and  the  lowest 
times  for  all  test  problems  except  the  small  and  large  problems  using  the  seven  point 
operator  (7-PT  and  L7-PT).  For  many  of  the  problems,  the  timing  differences  in  favor 
of  the  self-executing  version  of  the  code  ai'e  quite  substantial.  In  the  SPE  problems  1,2 
and  4  the  self-executing  version  PCGPAK  completes  in  less  than  70  percent  of  the  time 
required  by  the  pre-scheduled  version. 

Overheads  in  the  self-executing  version  of  the  program  arise  from  the  need  to  check 
and  update  the  shared  array  which  indicates  whether  needed  solution  variables  or  pivot 
rows  have  been  computed.  In  the  pre-scheduled  version  of  the  program,  overheads  arise 
from  the  cost  of  global  synchronizations.  Overheads  aside,  it  is  possible  to  show  that 
the  parallelism  available  from  the  self-executing  version  of  the  program  is  always  better 
than  in  the  pre-scheduled  version.  Measured  efficiencies  for  all  problems  except  7-PT 
and  L7-PT  favor  the  self-executing  version  of  the  program. 

In  section  5.1.2,  we  will  explain  the  differing  relative  performance  between  the 
pre-scheduled  and  self-executing  versions  of  PCGPAK.  This  will  be  done  by  showing 
that  for  the  test  problems,  we  can  account  in  a  quantitative  manner  for  the  timing 
differences  between  pre-scheduled  and  self-executing  versions  of  the  triangular  solves. 
We  also  present  in  Table  1,  the  times  required  to  perform  the  topological  sort  for  each 
of  the  test  problems.  In  each  of  these  test  problems,  the  time  required  to  perform  the 
topological  sort  required  for  global  index  scheduling  was  quite  small,  compared  to  the 
total  execution  time.  Since  the  scheduling  had  only  to  be  performed  once  and  was 
amortize<l  over  a  substantial  number  of  iterations,  even  the  relatively  expensive  global 
scheduling  did  not  represent  a  troublesome  overhead.  The  cost  of  performing  both  global 
and  local  scheduling  will  be  examined  in  much  more  detail  in  the  following  sections. 


5.1.2  Where  Does  the  Time  Go 

We  performed  an  operation-count  based  analysis  of  the  parallelism  that  could  be  ob- 
taiiu'd  given  a  particular  assignment  of  indices  to  processors.  The  analysis  made  the 
assuini)tion  that  the  load  lialance  could  be  characterized  .solely  by  the  distribution  and 
scheduling  of  the  floating  point  operations.  The  efficiency  estimated  on  this  basis  will  be 
called  the  »yinholically  e.*timatcd  efficiency.  In  tables  2  and  3  respectively,  are  depicted 
symbolically  estimated  efficiencies  for  self-executing  and  pre-scheduled  triangular  solves. 
The  estimates  presented  are  for  some  of  the  previously  discussed  test  problems  on  16 
processors.  The  parallelism  we  anticipate  obtaining  through  the  use  of  self-executing 
code  is  better,  frequently  by  a  wide  margin. 

The  efficiencies  predicted  by  operation  count  based  analysis  are  substantially  higher 
than  those  we  saw  in  Section  5.1.1.  This  is  not  surprising  since  the  symbolically  esti¬ 
mated  efficiencies  do  not  take  into  account  a  number  of  important  sources  of  overhead. 
We  will  demonstrate  that  we  can  account  for  these  overhead  sources  in  a  systematic 
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w!iy  iiiul  iis('  tli<’s<'  ()V('rln';i<l  values  to  accurately  j>r<'(lict  the  multiprocessor  timings  in 
both  sclf-cx('cuting  and  pre-scheduled  versions  of  a  standalone  program  for  paralleling 
a  sparse  lower  triangular  solv<\ 

In  Table  2  and  3  we  have  the  actual  multiprocessor  timings  on  16  processors  for 
lower  triangular  solves  arising  from  the  incompletely  factored  test  jiroblem  matrices. 
An  optimized  sequential  version  of  the  program  was  also  timed  for  each  of  the  lower 
triangular  systems.  We  depict  sequential  times  divided  by  the  product  of  the  number 
of  processors  used  and  the  symbolically  estimated  efficiencies  (timings  are  denoted  by 
1  PE  seq.  in  tables  2  and  3). 

The  estimates  of  multiprocessor  times  obtained  in  the  estimate  above  are  quite 
optimistic.  To  take  into  account  the  extra  operations  that  had  to  be  executed  by 
the  parallel  version  of  the  program,  we  timed  the  multiprocessor  program  on  a  single 
processor.  Tables  2  and  3  show  the  single  processor  parallel  code  timing  divided  by  the 
product  of  the  number  of  processors  used  and  the  symbolically  estimated  efficiencies  ( 1 
PE  Par.).  In  performing  this  calculation,  we  tacitly  assume  that  load  balance  effects 
of  the  distribution  of  work  in  the  multiprocessor  program  can  still  be  estimated  by 
taking  into  account  only  the  distribution  ofg  loatinripoint  calculations.  In  effect,  we  are 
assuming  that  the  effect  of  the  extra  operations  required  in  the  multiprocessor  program 
could  be  exidained  by  simply  adding  a  fixed  overhead  to  each  floating  point  operation. 

Contention  for  resources  such  as  shared  memory  and  bus  access  can  cause  ineffi¬ 
ciencies  that  are  not  accounted  for  by  the  above  estimates.  We  ran  a  version  of  the 
multiprocessor  code  designed  to  simulate  the  memory  and  communications  access  pat¬ 
terns  of  the  actual  program.  This  version  of  the  code  is  designed  to  have  a  perfect  load 
balance.  When  executed  on  P  processors,  this  program  executes  the  schedules  a  total 
of  P  times.  Each  processor  ends  up  executing  the  schedules  assigned  to  all  processors  so 
that  each  processor  ends  up  computing  the  work  associated  with  all  of  the  indices  in  the 
problem.  The  time  required  for  this  program  to  complete  is  called  the  rotating  processor 
time  because  each  processor  takes  on  the  work  assigned  to  each  other  processor  with 
control  being  shifted  in  a  rotating  fashion. 

No  synchronization  takes  place  in  this  version  of  the  codes.  The  shared  array  reads 
and  writes  u.sed  in  the  busy  wait  coordination  in  the  self-executing  code  still  take  place 
but  the  program  is  modified  so  that  no  waiting  actually  has  to  occur.  In  the  pre¬ 
scheduled  vf'ision  of  the  program,  ghffial  synchronizations  are  not  employed.  In  the 
aV)sence  of  resource  contention,  we  would  expect  that  the  time  required  for  the  above 
computation  would  be  very  close  to  the  time  sjamt  running  the  parallel  version  of  the 
codes  on  a  single  processor. 

In  the  self-executing  case,  the  time  estimate  obtained  from  dividing  the  rotating  pro¬ 
cessor  time  by  the  i)roduct  of  the  number  of  processors  and  the  symbolically  estimated 
efficiency  gi\es  a  very  close  estimate  of  the  actually  observed  multiprocessor  time  {Ro¬ 
tating  Estimate).  For  the  ])re-scheduled  case,  we  must  include  the  time  required  for  the 
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global  synchronizations  to  obtain  an  accurate  prediction  of  the  actual  multiprocessor 
time  [Rotatunj  Estimate.  +  Darner).  When  this  is  done,  wv  get  very  good  ('stimate 
of  the  pre-scheduled  multiprocessor  timings.  In  using  the  symbolically  estimated  effi¬ 
ciencies,  we  again  make  the  tacit  assumption  that  the  extra  overhead  (except  the  global 
synchronizations)  could  be  explained  by  adding  a  fixed  overhead  to  all  floating  point 
operations.  Note  that  while  more  sophisticated  models  of  overhead  are  certainly  possi¬ 
ble  and  may  be  desirable  in  some  cases;  we  find  here  that  these  simple  techniques  and 
assumptions  adequately  explain  the  timings  we  observe. 

The  sources  of  the  timing  differences  between  pre-scheduled  and  self-executing  pro¬ 
grams  becomes  more  apparent  in  comparing  tables  2  and  3.  For  the  5-PT  and  SPE2  test 
problems,  the  difference  in  the  load  balance  obtainable  through  the  use  of  pre-scheduled 
and  self-executing  codes  is  large  enough  that  the  I  PE  Seq  time  for  the  pre-scheduled 
code  is  greater  than  the  Parallel  Time  for  the  self-executing  program.  Even  if  we  had 
a  hypothetical  pre-scheduled  code  with  no  overheads  except  for  load  imbalance,  that 
code  would  still  be  less  efficient  than  the  self-executing  program.  Recall  that  the  pre- 
scheduled  program  uses  global  synchronizations  in  between  each  phase  but  does  not 
need  to  write  into  a  shared  array  to  ket'p  track  of  which  variables  have  been  calculated. 
In  a  reasonaI)ly  large  problem  such  as  7-PT  where  there  are  relatively  few  global  syn¬ 
chronizations.  the  or-erhead  required  for  pre-scheduling  is  relatively  small.  Since  little 
loss  due  to  load  imbalance  is  seen  for  7-PT,  we  are  able  to  see  that  pre-scheduling  gives 
a  slightly  faster  timing. 

In  Table  2  we  depict  the  time  required  for  a  doacross  loop  to  execute  each  tri¬ 
angular  solve.  We  see  that  the  doacross  loop  is  consistently  less  efficient  than  either 
the  presclu'duled  or  self-executing  loops.  For  example  in  the  SPE5  problem,  the  self¬ 
executing  solve  requires  23.4  milliseconds,  the  prescheduled  solve  (in  Table  3)  required 
29.0  milliseconds  and  the  doacross  version  of  the  solve  took  45.0  milliseconds. 

Recall  that  the  self-executing  loop  is  a  doacross  loop  with  a  reordered  index  set. 
We  expect  that  the  doacro.ss  loop  will  exhibit  less  concurrency  than  the  self-executing 
1'  op.  Since  the  doacross  loop  does  not  have  to  perform  array  references  to  access  the 
reordered  index  set,  we  expect  that  the  doacross  will  also  be  accompanied  by  smaller 
overheads.  The  results  of  measurements  not  presented  here  confirm  that  while  the 
concurrency  obtained  from  doacross  loops  was  quite  limited,  doacross  loop  execution 
was  accompanied  by  h’ss  overhead.  On  the  Multimax/320  measurments  indicate  that 
accessing  the  reordered  index  .set  is  relatively  expensive  and  hence  the  performance 
diflerencts  b<'tween  the  doacross  looj).>  and  the  reorclerc'd  loops  is  attenuated  to  some 
degree. 

5.1.3  Timing  Projections 

Since  we  can  accurately  account  for  the  execution  time  in  the  Encore  Multimax/320,  it 
is  reasonable  to  make  some  timing  projections.  These  projections  make  the  assumption 


Table  2:  Parallel  Tniie  and  Estimates  for  Self-Executing  Triangular  Solves 


Test 

Problem 

Phases 

Symbolic 

Efficiency 

Parallel 

Time 

1  PE 
S('q. 

Doacross 

Time 

SPE2 

CO 

0.89 

20.7 

17.9 

15.0 

33.9 

66 

23.4 

21.6 

18.5 

15.3 

45.0 

5-PT 

124 

0.95 

18.7 

17.6 

14.5 

12.2 

37.1 

IKBilUH 

311 

0.97 

57.9 

57.1 

51.7 

97.5 

1  7-PT 

58 

0.98 

_ 1 

56.3 

_ 1 

57.6 

45.1 

38.1 

84.1 

Table  3:  Parallel  Time  and  Estimates  for  Pre-Sclieduled  Triangular  Solves 


Symbolic 

Efficiency 

Parallel 

Time 

Rotating 
Estimate 
+  Barrier 

Rotating 

Estimate 

1  PE 
Parallel 

1  PE 
Seq. 

SPE2 

60 

0.52 

32.7 

32.8 

30.0 

26.6 

25.6 

SPE5 

66 

0.70 

29.0 

29.5 

26.4 

22.6 

20.8 

5-PT 

124 

0.61 

31.1 

31.0 

25.2 

20.2 

18.8 

msm 

311 

0.78 

80.3 

83.9 

C3^5 

56.7 

53.9 

58 

56.2 

56.3 

53.7 

44.0 

39.8 

that  the  costs  of  synchronization,  the  costs  from  the  extra  operations  required  to  run 
the  parallel  ^•ersions  of  the  codes  and  the  costs  due  to  contention  do  not  change  with  the 
number  of  i)i()cessors.  If  the  load  balance  were  perfect,  the  Best  efficiencies  in  Table  4 
woidd  Ije  ol>tained. 

The  estimate  of  non  load  balance  related  loss  [Btst  in  table  4)  obtained  from  timings 
on  IG  proc<’ss()is  is  clearly  not  valid  for  larger  machines  if  we  simply  add  more  processors 
to  the  current  machine.  The  estimate  is  reasonable  if  we  assume  that  the  capabilities  of 
I  he  shared  resources  such  as  interprocessor  communication  are  enginef'red  to  scale  with 
the  size  of  the  machine. 

If  is  cleaily  easier  to  assure  j)erformance  characteristics  that  scale  with  the  number 
of  processors  if  one  ch'sigus  machines  with  distributed  memory  or  a  hierarchical  shared 
memory.  We  are  currently  extemling  such  projections  to  those  types  of  machines,  that 
W(nk  is  Ijeycuid  the  scoj)e  of  this  paper  but  .some  discussion  of  that  issue  can  be  found 
in  [12]. 

In  Table  4.  we  pres«‘nt  efficiencies  for  IG  processors  and  projected  efficiencies  for  32 
and  64  processors.  The  projected  performance  of  the  pre-schedvde<l  programs  deteri¬ 
orates  much  more  rapid.ly  as  one  increases  the  number  of  processors.  This  difference 
is  driven  by  the  increasing  disparity  between  symbolically  estimated  efficiencies  in  the 
two  scheduling  methods.  The  differences  seen  in  the  Best  efficiencies  in  Table  4  reflect 
the  varying  relative  costs  of  global  synchronizations  and  array  writc's  in  problems  with 
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Table  4:  Estiuiatccl  Efficiencies  for  Larger  Machines 


Test 

Prol)lem 

Be 

S.E. 

St 

P.S. 

16  Pt 
S.E. 

ocessors 

32  Pr 
S.E. 

oerssors 

P.S. 

64  Pi 

S.E. 

ocessors 

P.S. 

SPE2 

75 

7S 

_ 

67 

40 

58 

25 

45 

12 

[■dUaM 

mm 

49 

56 

39 

46 

23 

5-PT 

65 

61 

52 

27 

55 

30 

34 

15 

7-PT 

66 

70 

65 

66 

64 

62 

■  -  -  j 

60 

55 

a-PT 

76 

64 

73 

52 

68 

26 

39 

12 

different  strnctures.  this  issue  was  discussed  in  Section  5.1.2. 


5.1.4  Effects  of  Local  Reordering 

In  Figure  12.  we  demonstrate  tlie  crucial  role  played  by  the  synchronization  mechanism 
in  determining  performance,  when  indices  arc  not  repartitioned  after  a  topological  sort. 
We  compare  the  estimated  efficiency  of  the  same  partition  and  schedule  using  global 
synchronizati(rn  and  self-executing  synchronization  in  a  matrix  generated  by  a  65  by 
65  point  mesh  using  a  5  point  stencil.  Indices  were  assigned  to  proces.sors  in  a  striped 
manner,  i.('.  for  P  processors  index  i  was  assigned  to  processor  i  modulo  P.  The 
'-clu'dule  wa.s  jiioduced  by  performing  a  topological  sort  and  scheduling  indices  in  each 
pha  .se  in  ord<n  of  increasing  index  numi>er.  We  can  see  that  the  results  obtained  through 
rlic  use  of  global  synchronization  vary  wildly  with  the  number  of  processors  used.  This 
is  u,.  Vrsttmdable  when  we  realize  that  the  poor  performance  arises  from  the  poor 
distribution  <.f  indices  among  processors  in  any  given  phase.  .411  work  assigned  to  a  phase 
must  be  comph'ted  before  any  work  corresponding  to  the  next  pha.se  can  commence. 
Ofti'ii,  many,  if  not  all  the  indices  in  a  i)ha.se  get  a.ssigncd  to  a  single  j)rocessor,  resulting 
in  scfiuentia!  execution  for  that  phase.  We  saw  this  effect  to  a  significant  degree  in  all 
of  the  ju-oblfins  we  examined  although  we  carefully  selectf'd  the  65  by  65  point  mesh  as 
th<‘  source'  of  the  dramatic  i>erformance  fluctuations  are  particularly  evident  from  the 
str\tct\u<'  of  die  problem. 

In  Figuie  12.  we  also  depict  tlic  [lerfonnance- obtained  on  the  model  problem  when 
self-f'xecuting  synchronization  is  employed.  In  a  great  many  case's,  data  from  all  indices 
in  a  given  wavefront  are  not  actually  recpiired  by  each  imlex  in  the  nc'Xf  wavefront.  M  hen 
self-executing  syncliroiiiza t ion  is  emiihm'd.  a  pipeliiu'  sort  of  ('ffi'ct  may  l)e  ge'iu'rated 
and  we  s<'(' substtmtial  performaiK’e  IreiK'fits.  Pre-sclieduling  on  th<' other  hand,  ajijx'ars 
to  be  much  Ic'ss  robust. 
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5.1.5  Local  v.s.  Global  Index  Set  Scheduling 

We  performed  a  set  of  experiments  to  examine  the  performance  tradeoffs  between  local 
and  global  index  set  scheduling  defined  in  sections  1.  We  used  only  the  self-executing 
loop  structur«'s  in  the  experiments  in  this  section.  Recall  that  when  global  index  set 
scheduling  is  used,  the  index  set  is  sorted  in  increasing  wavefront  order.  The  indc’x  set 
is  then  partitioned  between  processors  in  a  striped  manner.  For  the  local  sorting  method 
is  used,  the  initial  partition  of  indices  is  maintained,  but  their  ordering  is  changed  based 
upon  wavefront  numbers.  In  Table  5  we  present  the  sequential  time  required  to  solve 
each  test  problem,  the  times  required  to  perform  a  sequential  and  a  parallel  version  of 
the  sort  and  the  time  required  to  rearrange  indices  globally.  All  times  in  this  table  are  in 
milliseconds.  \Ve  also  de[)ict  the  time  required  to  perform  local  index  set  scheduling  as 
well  as  the  IG  processor  Multimax/320  timings  obtained  using  these  schedules.  The  time 
required  to  perform  the  sequential  scheduling  is  slightly  lower  than  the  time  needed  for 
performing  a  sequential  iteration.  For  e.xample,  in  the  case  of  SPEo,  the  time  required 
to  perform  the  sequential  sort  plus  the  triangular  solve  adds  up  to  220  ms,  while  a 
completely  sequential  execution  takes  240  ms.  Because  we  pay  for  the  sorting  only 
once,  subsequent  iterations  of  the  code  will  show  a  great  advantage  for  the  parallel 
code  (30  ms  vs.  240  ms  on  IG  processors).  The  time  required  to  produce  a  parallelized 
global  schedule  ranged  from  17  percent  to  G1  percent  of  the  time  needc'd  for  a  sequential 
iteration. 

From  Taljle  5,  we  can  see  that  local  index  set  scheduling  overhead  does  turn  out  to  be 
much  less  than  global  index  set  scheduling  overhead,  as  is  to  be  expected.  However,  as 
far  as  run  times  were  concerned,  local  and  global  scheduling  each  yielded  better  results 
than  the  other  for  some  test  problems.  For  example,  in  the  case  of  SPE2,  global  run 
time  was  21.3  ms  and  local  was  29.6  ms  and  for  SPE3,  global  gave  a  run  time  of  25.1 
while  local  wa.s  22.3  ms. 


6  Conclusions  and  Future  Work 


There  is  a  hi<  rarchy  of  pz'oljlems  with  different  levels  of  scheduling  complexity  that  are  of 
interest  to  researchers  in  the  field  of  parallel  programming.  When  the  data  dependences 
of  the  probii  ni  are  known  at  compile- time,  task  decomposition  can  automatically  be 
l>erfonued  Iiy  the  compiler.  However,  there  are  problems  where  workloads  cannot  be 
fully  charaett'i ized  during  compilation  due  to  data  dependences  that  become  manifest 
at  run-time.  In  [12],  we  piesented  our  initial  results  from  airplying  these  ideas  to  pre- 
schedulable  problems.  In  this  paper,  we  have  extended  the  class  of  problems  that  can  be 
effectively  compiled  by  izarallelizing  compilers.  We  presented  the  doconsider  construct 
which  would  allow  these  compilers  to  effectively  parallelize  such  problems. 

In  this  pajier,  we  havz-  reached  the  conclusion  that  for  the  types  of  workloads  we  have 
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SPEl 


SPE2 


SPE3 


SPE4 


SPE5 


seven.  1 


fivers.  1 


G5-10-1.5 


G5-10-3 


G5-4-1.5 


G5-4-3 


Gomesh 


Table  5:  Global  v.s.  Local  SclK'duliin^ 


Loral  Srciuriitial 

parallel  solve 
sort 


14.9  6.0  50.5 


29.9  29.6  223.2 


56.0  22.3  245.0 


6.0  47.6 


46.0  23.6  240.9 


78.0  54.7  615.7 


100.6  62.6  698.3 


38.8  28.8  192.0 


109.6  66.8  G33.6 


63.1  79.8  7G7.5 


38.8  44.2  394.9 


35.4  44.3  386.1 


82.0  22.8  241.7 


SCCl 

sort 

Global 

parallel 

sort 

solve 

43.9 

31.0 

6.2 

135.1 

48.8 

21.3 

245.0 

135.0 

25.1 

46.1 

29.8 

6.3 

191.3 

100.0 

30.0 

466.5 

203.5 

57.5 

465.0 

153.6 

58.3 

148.7 

72.1 

24.0 

384.8 

173.6 

58.8 

423.5 

131.7 

58.5 

284.4 

106.4 

34.4 

277.4 

101.1 

44.4 

213.0 

149.3  1 

30.6 

investigated,  self-execution  almost  always  performs  better  than  pre-scheduling.  Further, 
the  improvement  in  performance  that  accrues  as  a  result  of  global  topological  sorting  of 
indices  as  opposed  to  the  less  expensive  local  sorting,  is  not  very  significemt  in  the  case 
of  self-execution.  Thus,  we  are  left  with  a  2-dimensional  solution  space,  as  depicted  in 
Figure  1,  which  pictorially  summarizes  the  findings  reported  in  this  paper.  As  regards 
program  transformations  are  concerned,  we  have  shown  how  simple  annotations  might 
be  included  in  parallel  languages  in  order  to  aid  the  compiler  to  create  the  appropriate 
scheduler  and  executor,  given  a  shared  memory  architecture. 
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1.  Appendix  I 

1.1.  Preconditioned  Krylov  Mctliotls  Dar.kground 

We  briefly  present  the  basics  of  Krylov  methods  such  as  are  found  in  PCGPAK. 
Consider  a  large,  sparse,  system  of  linear  equations  of  the  form 


Mx-h  (1.1) 

where  Af  is  a  real  matrix  of  order  N,  b  is  a  given  vector  of  length  N  and  x  is  unknown  vector  to  be 
computed. 

Given  an  initial  guess  Xo,  Krylov  methods  generate  an  approximate  solution  r,  from  tlie  translated 
Krylov  space  xq  +  A',  where 


I<i  C  span{ro,^frQ,...,M'  Vo} 


Xi  is  usually  chosen  to  minimize  some  norm  of  its  residua)  b  —  M  Xj . 

The  basic  tasks  involved  in  Krylov  methods  are  sparse  matrix- vector  multiplies  with  matrix  M,  additions 
of  scalar  multiples  of  vectors  to  other  vectors  (SAXPYs),  and  vector  inner-products.  The  latter  are  used  in 
determining  the  linear  combination  of  Krylov  vectors  to  add  to  the  initial  guess  so  as  to  minimize  (he  norm 
of  the  residual. 

Preconditioned  Krylov  methods  consist  of  using  an  auxiliary  matrix  Q  =  QiQr  to  first  generate  the 
preconditioned  system 

{Qr'MQ;^)Qrx  =  Q7  ^ 


The  matrix  Q  is  chosen  to  be  an  approximation  to  M  for  which  it  is  easy  to  compul  e  Q,  V  and  for 

a  vector  v. 

Approximate  LU  factorization  preconditioners  have  been  found  to  have  very  favorable  convergence 
properties.  Here  we  take  Q  to  be  LU  where  L  is  lower  triangular  and  U  is  upper  triangular.  We  form  L  and 
U  by  a  process  of  incomplete  factorization  in  which  M  is  approximately  factored  in  a  way  that  allows  only 
limited  fill  to  occur. 

The  preconditioned  matrix-vector  multiply  in  the  resulting  Krylov  method  consists  of  doing  a  forward 
and  backward  sparse  triangular  solves  using  L  and  (/  as  well  as  the  sparse  matrix  multiplies  by  The  cost 
of  performing  this  incomplete  factorization  and  the  costs  of  solving  the  resulting  triangular  systems  tends 
to  be  much  smaller  than  the  costs  associated  with  an  exact  factorization  because  of  the  enforced  sparsity  of 

the  matrices  involved.  r  ,  . 

The  computation  in  PCGPAK  is  carried  out  by  (1)  performing  a  symbolic  incomplete  factorization  to 

determine  the  sparsity  structure  of  L  and  U,  (2)  numeric  calculation  of  the  incomplete  factorization  using  the 
previously  calculated  sparsity  structures  and  (3)  matrix  vector  muKiplies,  SAXPYs,  vector  inner  products 
and  sparse  triangular  solves. 


1.2.  The  Test  Problems 

We  now  present  the  eight  test  problems  used  in  our  experimenis. 


Problem  1 
(SPEl) 

Problem  2 
(SPE2) 


Problem  3 
(SPE3) 

Problem  4 
(SPE4) 


This  problem  models  the  pressure  equation  in  a  sequential  black  oil  simulation.  The  grid  is 
10  X  10  X  10  with  one  unknown  per  gridpoint  for  a  total  of  1000  unknowns. 

This  problem  arises  from  the  thermal  simulation  of  a  .steam  injection  process.  1  he  grid  is 
6x6x5  with  6  unknowns  per  grid  point  giving  1080  unknowns.  The  matrix  is  a  block  seven 
point  operator  with  6x6  blocks. 

This  problem  comes  from  an  IMPES  simulation  of  a  black  oil  model  The  matrix  is  a  seven 
point  operator  on  a  35  x  11  x  13  grid  yielding  5005  equations. 

This  problem  also  comes  from  an  IMPES  simulation  of  a  black  oil  mo<lel.  I  he  matrix  is  a 
seven  point  operator  on  a  16  x  23  x  3  grid  giving  1 101  equations. 
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Problem  5  This  problem  arises  from  a  fully-implicit,  simultaneous  solution  simulation  of  a  black  oil 
{SPE5)  model.  It  is  a  block  seven  point  operator  on  a  16  x  23  x  3  grid  with  3x3  blocks  yielding  3312 
equations. 


Problem  6  This  problem  is  a  five  point  central  difference  discretization  of  the  following  equation  on  the 
(5-Pt)  unit  square: 


with  Dirichlet  boundary  conditions  and  /  chosen  so  that  the  exact  solution  is 


u  =  re**'sin(Tx)sin(xy). 


The  discretization  grid  is  63  x  63  giving  3969  unknowns.  The  L5-pt  problem  is  the  same 
problem  with  a  200  x  200  grid. 


Problem  7 
(9-pt) 


This  problem  is  a  nine  point  box  scheme  discretization  for  the  following  equation  on  the  unit 
square: 

with  Dirichlet  boundary  conditions  and  f  chosen  so  that  the  exact  solution  is 


u  =  xc**'8in(xx)sin(iry). 


The  discretization  grid  is  63  x  63  giving  3969  equations.  The  L9-pt  problem  is  the  same 
problem  with  a  127  x  127  grid. 


Problem  8  This  problem  is  a  seven  point  central  difference  discretization  of  the  following  equation  on  the 
(7-pt)  unit  cube; 


+  X  +  y  +  z 


)«  =  / 


with  Dirichlet  boundary  conditions  and  /  chosen  so  that  the  exact  solution  is 


ti  =  (1  -  x)(l  -  y)(l  -  z)(l  -  e-*)(l  -  e-^-Kl  -  c"'). 

The  discretization  grid  is  20  x  20  x  20  yielding  8000  equations.  The  L7-pt  problem  is  the  same 
problem  with  a  30  x  30  x  30  grid. 


2.  Appendix  II:  Parallel  Implementations  of  the  Basic  Krylov  Method 

2.1.  SAXPY  operations,  Vector  inner^products,  and  Sparse  matrix-vector 

The  easily  parallelizable  procedures  in  the  preconditioned  Krylov  methods  implemented  here  are  the 
SAXPY  operations,  the  vector  inner  products  and  the  sparse  matrix-vector  products.  For  p  processors  and 
a  linear  system  of  order  n,  the  indices  from  1  to  n  are  divided  into  p  contiguous  groups  of  roughly  equal  size. 
The  i**  group  is  assigned  to  the  i"*  processor. 

2.2.  Parallel  Triangular  Solves  and  Sparse  Numeric  Factorizations 
2.2.1.  Triangular  Solves 

"fhe  triangular  solve  and  the  sparse  numeric  factorization  can  often  be  efficiently  parallelized  once  the 
matrix  dependent  data  dependencies  are  known.  Refer  to  Figure  8  for  a  description  of  the  triangular  solve 
code. 
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2.2.2.  Sparse  Factorizations 

In  a  straightforward  sequential  version  of  gaussian  elimination  without  pivoting,  consecutive  pirot  rows 
i  are  used  to  eliminate  any  non-zeros  in  column  i  of  all  rows  i  -t-  1  in  N .  All  non-zeros  to  the  lef(  of  row  r’s 
diagonal  are  eliminated  before  a  i  becomes  a  pivot  row.  When  all  non-zeros  to  the  left  of  j’s  diagonal  are 
eliminated,  we  say  that  row  i  has  been  stabtitzri. 

The  elimination  process  tends  to  introduce  new  non-zeros  or  fill  into  the  factored  matrix.  An  approxi¬ 
mate  factorization  can  be  carried  out  by  selectively  suppressing  the  creation  of  many  of  the  non-zeros  created 
during  the  factorization  process.  The  suppression  is  performed  on  (he  basis  of  determining  how  tiidtrrf  t  the 
fill  was.  For  instance,  all  fill  created  by  eliminations  using  the  first  matrix  row  as  a  pivot  row  ari.se  directly 
from  non-zeros  present  in  the  original  matrix.  On  the  other  hand,  when  row  2  is  stabilized,  non  zeros  in 
that  row  may  arise  directly  from  a  non-zero  present  in  the  original  matrix  or  may  arise  as  a  re.sult  from 
fill  from  row  1.  There  are  a  variety  of  methods  used  to  quantify  the  indirectness  of  fill;  only  fill  that  is 
sufficiently  direct  is  retained  and  is  capable  of  generating  further  fill.  The  specifics  of  the  algorithm  used 
here  to  determine  which  elements  are  to  be  retained. 

During  the  course  of  the  computation,  each  row  i  undergoes  a  number  of  transformations  as  non-zero 
elements  in  consecutive  columns  j  <  t  are  eliminated  by  stabilized  pivot  rows  j.  When  all  non  zeros  in 
columns  j  <  i  have  been  eliminated,  row  i  itself  is  stabilized  and  may  be  used  as  a  pivot  row  in  other 
eliminations. 

The  incomplete  factorization  procedure  consists  of  a  symbolic  and  a  numeric  factorization.  The  symbolic 
factorization  calculates  the  non-zero  structure  of  the  factored  matrix,  and  the  numeric  factorization  computes 
the  numeric  values  for  the  incompletely  factored  matrix. 

The  numeric  factorization  is  parallelized  in  a  way  that  is  analogous  to  the  triangular  solve.  Elimination 
in  each  row  i  requires  the  use  of  a  sequence  of  stabilized  pivot  rows  identified  as  before  by  the  sparse  data 
structure  ija.  (figure  13).  In  parallelizing  the  numeric  factorization,  a  topological  sort  of  the  dependen¬ 
cies  pertaining  to  the  outer  loop  indices  is  performed.  As  was  shown  explicitly  for  the  triangular  solve, 
prescheduled  and  self-executing  versions  of  the  numeric  factorization  algorithm  can  be  formulated. 


SI  do  i=l,n 

do  j=ija(i),ija(i-|-l)-l 

Use  pivot  row  ija(j)  to  perform  elimination  on  row  i 
end  do 

end  do 


Figure  13;  Schematic  Si)arse  Factorization 


2.3.  Sparse  Symbolic  Factorizations 

Because  the  pattern  of  fill  is  not  known,  the  data  dependencies  in  symbolic  factorization  .nniiot  be 
analyzed  before  the  algorithm  executes.  In  our  implementation  of  the  algorithm,  we  distribute  the  rows  of 
the  matrix  over  processors  in  a  wrapped  manner  and  execute  in  a  self-scheduled  fashion 

Since  we  are  dealing  with  incomplete  factorization  of  sparse  matrices,  the  fill  pattern  will  be  sparse. 
The  columns  of  row  i  that  are  filled  in  at  any  given  stage  of  the  algorithm  are  kept  sorted  in  increasing  order 
in  a  linked  list.  Operations  on  row  j  with  pivot  row  j  require  that  the  list  of  non-zeros  pertaining  'o  row 
i  be  merged  with  the  list  of  non-zeros  pertaining  to  pivot  row  j.  Note  that  because  this  is  an  incomplete 
factorization,  some  of  the  non-zero  elements  in  the  newly  created  merged  list  are  omitted. 
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